A fast and specific alignment method for minisatellite maps.

BACKGROUND
Variable minisatellites count among the most polymorphic markers of eukaryotic and prokaryotic genomes. This variability can affect gene coding regions, like in the prion protein gene, or gene regulation regions, like for the cystatin B gene, and be associated or implicated in diseases: the Creutzfeld-Jakob disease and the myoclonus epilepsy type 1, for our examples. When it affects neutrally evolving regions, the polymorphism in length (i.e., in number of copies) of minisatellites proved useful in population genetics.


MOTIVATION
In these tandem repeat sequences, different mutational mechanisms let the number of copies, as well as the copies themselves, vary. Especially, the interspersion of events of tandem duplication/contraction and of punctual mutation makes the succession of variant repeats much more informative than the sole allele length. To exploit this information requires the ability to align minisatellite alleles by accounting for both punctual mutations and tandem duplications.


RESULTS
We propose a minisatellite maps alignment program that improves on previous solutions. Our new program is faster, simpler, considers an extended evolutionary model, and is available to the community. We test it on the data set of 609 alleles of the MSY1 (DYF155S1) human minisatellite and confirm its ability to recover known evolutionary signals. Our experiments highlight that the informativeness of minisatellites resides in their length and composition polymorphisms. Exploiting both simultaneously is critical to unravel the implications of variable minisatellites in the control of gene expression and diseases.


Polymorphic tandem repeats
Polymorphic tandem repeat loci, also known as Variable Number Tandem Repeats (VNTR), are widely used as genetic markers in population genetics, gene mapping, and forensic medicine (Jeffreys et al. 1985). Microsatellites, which are tandem repeats of a 1-6 bp long motif, show a frequent variability in their number of repeats.The expansion in some triplet microsatellites forms the molecular basis of a dozen inherited neurodegenerative diseases (Cummings and Zoghbi, 2000). Polymorphism was observed in another class of tandem repeats with motif size in the 7-100 bp range, the minisatellites. Unlike microsatellites, unstable minisatellites display not only repeat number variability, but also sequence heterogeneity between repeats. The succession of the variant repeats along a minisatellite allele can be obtained by a specific method called Minisatellite Variant Repeat-PCR (Jeffreys et al. 1991) (MVR-PCR). It provides a MVR map: a string of symbols in which each symbol represents a different variant of the minisatellite repeat unit. The correspondence between the minisatellite sequence and the map is illustrated below. inter-allelic rearrangements happen at much lower frequency (10 −5 ) . In other eukaryotic species, no hypervariable minisatellites were discovered (Bois and Jeffreys, 1999). Detailed investigation of murine minisatellites provides evidence of a variability dominated again by simple intra-allelic duplication occuring at a rate of 10 −4 per gamete (Bois et al. 1998). This suggests that these minisatellites can serve for evolutionary population studies. Indeed, alignments between MVR maps have also been used to deduce the evolutionary relationships between alleles for the study of recent human population history (Alonso and Armour, 1998;Armour et al. 1993;Hurles et al. 1998;Stead and Jeffreys, 2002). Both the potential investigative power of variable minisatellites for evolutionary studies and their use for identification has been limited by the lack of computerized methods to objectively compare alleles.
The evolutionary events at work in minisatellite turnover are divided into inter-and intra-allelic events. Inter-allelic events meanrearrangements between the two alleles of an autosomal minisatellite, while intra-allelic events comprise the amplification and contraction of a repeat or of a block of consecutive repeats, as well as the nucleotidic mutations inside the repeats. For the acquisition of MVR maps, the limits of the variant repeats are chosen arbitrarily, and when comparing maps, duplication events are assumed to copy complete variants (and not, for example, a variant and the half of the following variant). However, the mechanisms of DNA duplication may duplicate any segment of DNA inside the minisatellite, and their templates do not always correspond to complete repeats. Therefore, comparison of minisatellite maps relies on the assumption that the boundaries of the variant repeats are fixed and that duplications copy complete variants. This assumption (discussed from the algorithmic view-point in (Benson and Dong, 1999) and (Rivals, 2004)) may not be satisfied for all minisatellites, but seems generally valid for polymorphic tandem repeat loci.
Variable minisatellites were also shown to be involved in the development of inherited diseases: either by influencing gene transcription, like in the progressive myoclonus epilepsy type 1, or by being part of a coding sequence, like in the prion protein gene, which is responsible for the Creutzfeld-Jakob disease (Buard and Jeffreys, 1997;Bois and Jeffreys, 1999). The informativeness of unstable minisatellites has led to their widespread use for individual identification, parentage analysis (Jeffreys et al. 1985), and for discrimination between bacterial strains, including anthrax strains (Le Flèche et al. 2001).

Minisatellite variability and turnover processes
Comparison between the internal structure of alleles has been shown to be the key to elucidate the mechanism of minisatellite expansion and deletion for several human autosomal GC-rich minisatellite loci. Complex rearrangements involving transfer of groups of repeats between alleles as well as intra-allelic duplications have been deduced by alignment "by eye" between MVR maps of progenitor and new length alleles (Buard and Vergnaud, 1994;Jeffreys et al. 1994;May et al. 1996;Tamaki et al. 1999). However, as the rearrangements may completely reshuffl e the repeat array in a single generation (Tamaki et al. 1999), the parental relationships is easily lost when studying more distantly related alleles (e.g., from different populations). This renders unstable minisatellites showing a significant fraction of mutations as highly complex rearrangements involving interallelic exchanges inadequate tools for populationwide evolutionary studies. Among those minisatellites are hypervariable GC-rich autosomal loci in human.
Many variable minisatellites with mutation rate below 1% per gamete have been reported in human and in other species. For instance, in the human insulin minisatellite, the variation is mainly due to the gain or loss of one repeat, which occurs at a rate of 10 −3 per gamete, while complex 1.3. Existing algorithms for minisatellite comparison Several methods to compare MVR maps were published recently. All of them consider solely intra-allelic evolution. The statistical similarity measure defined in (Brion et al. 2002) computes a weighted sum of the number of shared variants when the two maps are compared at different relative positions. This measure depends to a great extent on the weight function used; in addition, the distance based on it is not a metric, which is a serious drawback for phylogenetic reconstruction. Alignment of minisatellite maps under a specific evolutionary model that considers indels, substitutions, but also tandem duplications and contractions of variants was first described in (Bérard and Rivals, 2003). There, as well as in (Behzadi and Steyaert, 2004;Behzadi and Steyaert, 2005), duplications and contractions are limited to a single variant (e.g., abc → abbc); in other words, the duplication of a block of consecutive variants (e.g., abcd → abcbcd ) is not allowed as a single event. Compared to classical sequence alignment, the result of a series of events on a map is order dependent (e.g., duplication + substitution ≠ substitution + duplication), which makes the computation more complex. In (Bérard and Rivals, 2003), the proposed alignment procedure accounts for these dependencies and computes an optimal alignment under a model where all mutations have the same cost.
Other works aim at improving the effi ciency of this algorithm. A first method performs a Run-Length Encoding of the maps to reduce the complexity of the procedure (Behzadi and Steyaert, 2004). Run-Length Encoding (RLE) is a data compression technique in which a stretch of the same letter is coded as a power of this letter, e.g., a 3 instead of aaa. The algorithm of (Behzadi and Steyaert, 2005), performs the computation of the alignment distance in cubic time using a model that allows for variable mutation costs, i.e., costs that depend on the variants involved. However, unlike in (Bérard and Rivals, 2003), even if the elementary costs are symmetrical, the distance computed is not symmetrical, which prevents it from being a metric. Recently, an algorithm that accounts for duplication and excision of blocks in the alignment of DNA sequences has been published (Sammeth et al. 2005). Here, a more complex evolutionary model is considered as a block of several variants may be duplicated in a single event; this explains why the algorithm complexity is exponential. In addition, none of these methods, (Behzadi and Steyaert, 2004), (Behzadi and Steyaert, 2005), (Sammeth et al. 2005), is available to the biological community. Thus, we designed and implemented an exact algorithm to align maps that uses RLE for effi ciency and computes an alignment distance that is a metric. Our mutational model authorizes variable mutation costs as in (Behzadi and Steyaert, 2005).

Biological validation
To validate our algorithm, we choose a data set from a minisatellite on the human paternally inherited Y chromosome, called MSY1. Most of the Y chromosome is haploid and thus escapes recombination. It contains the whole range of polymorphic systems observed in the human genome with mutation rates varying from 5.10 −7 % for base substitutions to 3.8% for the hypervariable minisatellite MSY1 (DYF155S1) (Andreassen et al. 2002). Probably due to its obligatory intra-allelic mode of mutation (no complex rearrangement between alleles) and to its AT-richness, MSY1 evolution is dominated by the gain or loss of a single variant (Andreassen et al. 2002), and is thus adequate to the model we hypothesized in this work. As a result, MSY1 alleles have a simple modular structure of variant repeats compared with the intermingled pattern of variants resulting from recombinational exchanges at hypervariable autosomal minisatellites (Figure 1 and also ). Homogenization of the variants along the array has been observed at the locus MSY1, suggesting the occurrence of complex rearrangements like at autosomal loci. However, this phenomenon is restricted to alleles belonging to a single Y haplogroup .
MSY1 haploid nature also avoids physical separation of the two alleles, which is a technical bottleneck for autosomal minisatellites. Consequently, a large number of MSY1 alleles originating from males all over the world were typed by several authors. The Y chromosome represents a unique system for comparing inter-chromosome relationships established with MVR maps and those deduced from more stable markers. (Jobling and Tyler-Smith, 2000), studied the evolutionary relationships between stable markers on the Y chromosome. They defined haplogroups using these markers and reconstructed a most parsimonious tree for them. The availability of known precise relationships between a large set of alleles is a major reason for the choice of MSY1.
The purpose of our experiments was to investigate whether known phylogenetic relationships between Y chromosomes could be independently recovered from the alignments of MSY1 MVR maps. Moreover, as the MSY1 MVR maps were obtained from individuals taken from different populations, we could check whether the coalescence of MSY1 haplotypes we inferred inside a Y chromosomal haplogroup agrees with the populations' relationships. We analyzed 609 alleles of MSY1 with our program and found inter-and intra-haplogroup relationships that are consistent with the evolution of the Y chromosome.

Article overview
The remaining of this paper is organized as follows. In Section 2, we provide a notation and define an evolutionary model for minisatellites. Section 3 explains the alignment algorithm. Section 4 describes the experiments performed on 609 alleles of the MSY1 human haploid minisatellite to validate our program MS _ ALIGN. We conclude in Section 5.

Notations
Let Σ be a finite alphabet of variants. A map s is a string of |s| characters of Σ indexed from 1 to |s|.

Evolutionary model
Tandem repeat sequences undergo two kinds of evolutionary events: point mutations (substitutions and indels) that act on nucleotides and modify the minisatellites variants, and specific events acting on a complete variant: tandem amplification and its opposite, tandem contraction. The change of a variant into another, be it caused by a nucleotidic substitution or by an indel, is called a mutation in the sequel. An amplification duplicates a variant and put the generated copy at its side. Our evolutionary model also takes into account the events of insertion and deletion of one variant. Therefore, it contains five evolutionary events on variants: mutation of a variant into another, insertion, deletion, amplification, and contraction. We take into account amplifications and contractions of one variant, i.e., producing/deleting only one copy at a time; these are the most frequent events at MSY1 (Andreassen et al. 2002).  The columns indicate the haplotype code, his population of origin, his haplogroup, and the MSY1 map . For space reasons, the maps of the alleles m110 and m125 have been shortened by replacing a block of consecutive type 3 or type 4 variants by three dots. These five operations are gathered under the term elementary operations as they involve one variant at a time. A positive cost is associated to each operation. A succession of events that transforms s into r is called an alignment between s and r. The alignment cost is the sum of the costs of the operations it contains.We denote each cost by the uppercase initial of the corresponding operation: I for insertion, D for deletion, A for amplification and C for contraction. The mutation cost depends on the variants. We denote As the frequency of amplifications and contractions on minisatellite maps is higher than the frequency of other operations, these two events have a lower cost: The alignment cost under this model is a metric (see the proof of (Bérard and Rivals, 2003)). It matters when using the cost as a distance, for example in phylogenetic reconstruction.

Segmental operations
The set of operations of our evolutionary model induces "long distance" dependencies. As an example, let us consider the following two generations of sequence aba from character a: .

⎯ → ⎯⎯
The first one contains 2 amplifications and 1 mutation though the second one contains 2 amplifications and 2 mutations. In the first generation, the last character a of aba appeared before the b and was produced by an amplification of a variant a, which is no longer at its side. This example illustrates the non-commutativity of operations, i.e., that the order in which the operations are applied matters. A procedure that would consider the variants independently from left to right, as the second generation scheme, cannot always find the optimal alignment. To handle such cases, we define two supplementary operations: Generation of a substring and Compression of a substring. These operations enable to align straightly one variant with a whole substring, taking into account the optimal application order of elementary operations. Generation and compression of a substring are symmetrical and gathered under the term segmental operations.
Example 3: An illustration of segmental operations with the alphabet of variants being a, b, c, d is shown on Figure 4. From the rightmost occurrence of b in the upper map, we generate the substring bbcaccbb in the lower map. In a generation, any symbol of the generated substring (in the lower map) is an offspring of the source symbol (in the upper map). The sequence of operations is represented as a tree whose root is the b in the upper map, and whose leaves ordered from left to right are the characters of the generated substring. If we consider the tree bottom-up, then it represents the compression of the substring bbcaccbb into the symbol b.

Algorithm
Our alignment algorithm is composed of two phases: 1. Computation of the costs of all possible segmental operations in each map; 2. Alignment of the 2 maps by dynamic programming, taking into account both the elementary and the segmental operations computed at step 1.

First phase
optimal sequence of operations that generates s[i.. j] from the symbol a, which does not start by two successive mutations. Indeed, if it starts by changing a into b, and then changing b into c, then mutating directly a into c would not cost more.
Using the two tables S s and C s allows to avoid computing such non-optimal generations, and improves the time complexity. In S s , only the cells located strictly above the main diagonal are used (i.e., such that i < j). The recurrences to fill jointly tables C s and S s are: (a 1 ) Duplicate the symbol a, which yields aa, and then generate the complementary prefix and suffi x from each a, which is accounted for by summing C s (a, i, k) and C s (a, k + 1, j). (a 2 ) Generate the prefix s[i..k] from symbol a, and generate the suffi x s[k + 1..j] from ε, (a 3 ) or symmetrically, generate the prefix from ε, and generate the suffi x from symbol a.
For each pair (i, j), corresponding to the substring s[i..j], and each letter a, we first compute S s (a, i, j) by Equation (a), and then C s (a, i, j) by equations (b) and (c). The initialization computes the generation costs of all one-character substrings, i.e., for all pairs (i, j) such that i = j. In this case, the segmental operation corresponds to an elementary operation: either an insertion if we start from ε (C s (ε, i, i) = I ), or a mutation if we start from a symbol (C s (a, The main recurrence is decomposed in 3 cases, it uses the intermediate This preprocessing is performed on both s and r to obtain the tables C s and C r in time O(p 3 |Σ|), with p = max(n, m). Yet, one can speed up the preprocessing by exploiting the repetitiveness of the sequences, and by using instead of sequence s, its Run-Length Encoded (RLE) version denoted s u. E.g., if s = aaaabbcccaaa, then s u = a 4 b 2 c 3 a 3 and, |s u| = 4 (though |s| = 12). The precomputing of C s is illustrated in Figure 2

Second phase
To align globally maps s and r, we construct a dynamic programming matrix A, indexed on

Improvements over the algorithm of (Bérard and Rivals, 2003)
We provide a novel program (MS_ALIGN Version 2) to compare minisatellite maps. The set of events authorized comprises all elementary events that occur in the intra-allelic evolution of a minisatellite. The evolutionary model has been extended to account for variable mutation costs. Note that the model can be extended to enable all costs to be variant dependent. In the original model M(a, b) was identical regardless of a, b ∈ Σ such that a ≠ b. Second, we have generalized "arches" operations to generation and compression of all substrings, which yields a simpler formulation of the algorithm. Finally, by using RLE maps we reduced the running time, although the worst-case complexity is higher because of the model. Thus, the algorithm is now faster and so more exploitable. Figure 4 displays an alignment resulting from our algorithm. This figure shows at the same time the alignment and the scenarios associated to the segmental operations.

Further direction for algorithmical refinements
We also envisaged reducing the complexity of our algorithm O(p 3 |Σ| + p 3 ) to O(p 3 |Σ|). This time complexity could be achieved if the 2nd phase was done on RLE sequences. This appears to be difficult. Indeed, it would only be possible if the blocks do not overlap in an optimal alignment between two maps. However, in the counter-example shown b e l o w , w h e r e s = a 9 9 0 b 1 a 1 0 a n d r = a 10 b 1 a 990 , the unique optimal alignment 1 costs 2 × M(a, b), and, it contains two large a blocks overlapping each other.
Nevertheless, in practice MS_ALIGN can compare several hundred maps (the size of the largest available data sets) in a reasonable time, which enables the user to test the robustness of the alignments by varying the event costs.

Implementation
Our algorithm is implemented in a program called MS_ALIGN, which can be run through a web interface. For more details, we refer the reader to the section User's guide of the website http://atgc. lirmm.fr/ms_align/. The distances matrix output by MS_ALIGN can be directly used as input of standard phylogenetic reconstruction programs.

Biological validation on MSY1
In Section 3, we proposed an algorithm to align minisatellite maps under the evolutionary model detailed in Section 2. An optimal alignment of two maps represents a series of mutations needed to transform one map into the other. The alignment score can serve as a weighted measure of the evolutionary distance between these two maps. Thus, we wanted to know whether this measure is adequate to infer evolutionary relationships between the haplotypes represented by these maps.To this purpose, we tested our program on real biological data. We chose a large set of 609 MSY1 alleles of men originating from 76 different populations. The main reason for this choice resides in the availability of known evolutionary relationships for the Y chromosomes, and the knowledge of the Y-chromosomal haplogroup of each individual of the data set. Indeed, human Y chromosomal haplogroups have been defined from the analysis of more stable Y marker systems, mainly SNPs, and an evolutionary tree for these haplogroups have been proposed (Jobling and Tyler-Smith, 2000). Therefore, we used MS_ALIGN to compute metric distances between pairs of alleles and infered phylogenetic trees from these distances using BIONJ . We could then investigate whether known phylogenetic relationships between Y chromosomes could be independently recovered from alignments between MSY1 MVR maps.

Data set
Mark Jobling provided us with the MSY1 maps of 690 individuals originating from 76 populations. Figure 1 displays examples of MSY1 maps. The MSY1 variants are 25 bases long and 7 different variants have been identified. Besides these vari-ants, some maps display unidentified variants, which are termed null repeat. For example, this null repeat occurs in the alleles m47, m82, m121, m6, and m715 of Figure 1. We excluded the maps that contain more than 3 adjacent null repeats; this yielded a test set of 609 maps, with an average length of 70 repeats and less than 1 null repeat in average. On the Y chromosome, 27 haplogroups were defined using SNPs and other stable markers, and a most parsimonious tree was reconstructed for them (Jobling and Tyler-Smith, 2000). The MVR maps are identified by a code, their population of origin, and their haplogroup number.
Although a new nomenclature of Y chromosomal haplogroups has been published recently (The Y Chromosome Consortium, 2002), we use the nomenclature of (Jobling and Tyler-Smith, 2000) in the sequel. The new nomenclature has defined a larger number of haplogroups and inferred an evolutionary tree of their relationships using SNP data. However, the correspondence between the old and new nomenclatures is complex: for instance, the old haplogroup 2 is now split between clades B, G, and I of the new haplogroup tree.Without further information, it is not possible to determine to which of the new haplogroups an individual in our data set belongs, nor to compare our results to the haplogroup tree of new haplogroups. However, the data set based on the nomenclature of (Jobling and Tyler-Smith, 2000) is relevant for validation purposes.
The costs we used are A = C = 1, D = I = 40 and M(a, b) = 10 ×d H (a, b), where d H (a, b) is the number of nucleotides that differ from variant a to variant b, i.e., the Hamming distance between a and b. For MSY1, d H (a, b) ≤ 3 ∀ a, b ∈ Σ. Since the introduction of a new letter can be done either by an insertion or by an amplification followed by a mutation, the value of I (and so D) is unim- M(a, b)) ∀ a, b ∈ Σ, which is true for our application. The ratio M(a, b)/A is important and we tried several ratios in our experiments. The best results are obtained for the values given above, in which the ratio corresponds approximatively to the value −log(mut. frequency/ amp. frequency).

Haplogroup prediction with MSY1 maps
Using the matrix of pairwise distances between individuals and the k-nearest-neighbors method (Gordon, 1999), we predicted the correct haplogroup 80% of the time for k = 3 to 5 neighbors. Moreover, the percentage of time the correct haplogroup is within the 3 most represented haplogroups is about 93%. Always predicting the most probable class leads to a classification rate of 21.5%. In the same way, randomly predicting the class of the allele using the prior distribution of classes leads to a classification rate of 11.7%. This last rate is calculated as Σ i (|C i |/ Σ j |C j |) 2 , where |C i | denotes the cardinality of the class C i . The prediction based on the alignment distance between maps outperforms a random prediction, revealing the relation between the sequence of variants of an individual and its haplogroup.

An evolutionary tree of the Y-chromosome haplogroups derived from MSY1
From the matrix of distances between the individuals, we compute the matrix of average distances between the haplogroups.We then let BIONJ  reconstruct an evolutionary unrooted tree for the haplogroups with these distances as input. This tree is shown in Figure 5(a). We only include haplogroups for which at least 5 maps are available (1-4, 8-12, 15, 16, 18, 21, 22, 24, 26). A most parsimonious tree of the Y-chromosome haplogroups based on substitutional polymorphisms was proposed (Jobling and Tyler-Smith, 2000). A modified version of this tree, called SNP tree, appears on Figure 5(b). The modification consists in, first, restricting the tree to the haplogroups that also belong to our tree, and second, creating additional leaves for the haplogroups that label internal nodes in the original tree. This tree contains several polytomies, which prevent the direct comparison with our tree. Hence, we compute the number of leaves that should be removed from both trees in order to get maximal compatible subtrees (Page and Holmes, 1998), i.e., the same subtrees except that a polytomy in the SNP tree can be replaced by a binary subtree in the other tree. For this, 4 removals are suffi cient: haplogroups 4, 18, either 12 or 16, and either 21 or 8. This value of 4 leaves out of 16 shows how strongly the trees are related. The resulting compatible subtrees obtained with these removals can be seen as supplementary material.
In conclusion, the MSY1 tree agrees with the SNP tree for the most recent levels of evolution, which is consistent with the hypervariability of MSY1.

Internal evolution in two large populations
We constructed two trees for the Finnish and the Mongolian populations (Figures 6 and 7). A remarkable feature is that subsets of maps of the same haplogroup cluster together. Clearly, the structures of both trees reflect the relationships of the haplogroups in the MSY1 haplogroups tree, although the latter is based on average haplogroups distances. Hence, when the trees are estimated from the raw distances, they corroborate the partition in haplogroups and the relationships between these haplogroups. Figure 5. (a) The haplogroup tree infered from inter-haplogroup average distances between MSY1 alleles. We compared all MSY1 alleles pairwise with our algorithm and computed interhaplogroup average distances. We reconstructed an evolutionary tree from these distances using the distance based phylogenetic reconstruction program, BIONJ . (b) The haplogroup tree reconstructed from other stable markers (termed SNP tree) modified from (Jobling and Tyler-Smith, 2000). In both trees, the haplogroups are denoted by their number following the nomenclature of (Jobling and Tyler-Smith, 2000).

Evolutionary relationships within the haplogroups
The Y chromosomes tend to be more closely related to one another inside a population than autosomal chromosomes. This and the difference in behavior between gender result in geographical specificity of the Y chromosomes. By computing evolutionary trees for the haplotypes within a haplogroup, we could check whether the alignment of MVR maps is able to recover a geographical clustering of the haplotypes.
The tree of haplogroup 16 (Figure 8) contains 57 MSY1 haplotypes, among which 3 populations are represented by several individuals: Mongolian (23), Finnish (10), Yakut and Siberian Yakut (13 + 5). In the tree, the Yakuts are monophyletic, all Finns but two form a monophyletic group, and the Mongolians agglomerate together (with one Japanese, two Finns, two Russians, and two Norwegians) and branch out between the Finnish and Yakut subtrees. Here, for haplogroup 16, whose number of haplotypes is much larger than haplogroup 4, we observe a high level of population specific coalescence. Apart from a few individuals in the Mongolian group, geographical separation appears clearly in this tree and agrees well with the geographical specificity of the Y.

Conclusion
Here, we presented a novel method for the alignment of minisatellite maps, which considers an extended evolutionary model with variable mutation costs. It improves in simplicity and in computational time upon previous solutions. Moreover, the program MS_ALIGN can be used through a web-interface and is available upon request from the authors. We have applied our method on a large real data set from the human haploid hypervariable minisatellite MSY1. The alignment distance enables us to recover known phylogenetic relationships between Y-chromosomal haplogroups, showing the validity of the approach. In tentative experiments, we investigate the coalescence of alleles within haplogroups, and the outcome suggest the method could prove useful for microevolutionary studies. Our results highlight that the informativeness of minisatellites resides in their length and composition polymorphisms, which can both be exploited simultaneously.
MS_ALIGN can be also used to analyze other types of tandem repeats.
To further validate our program for a wider range of minisatellite, we tested it with the variable GC-rich autosomal insuline minisatellite (INS). A study of the structural diversity of INS alleles could assign the alleles in three lineages called classes I, IIIA, and IIIB. Visual inspection and multidimensional scaling further divided class I into classes IC and ID . In an experiment similar to those performed with MSY1 data, we compared the set of 181 INS alleles published in  with MS_ALIGN, and reconstructed a coalescence of these alleles from the resulting alignment distances (tree available as Supplementary Material). In this tree, the classes IC, ID, IIIA, and IIIB are all monophyletic, and the main split separates classes IC/ID from classes IIIA/IIIB. Again, by comparing the alleles of a variable GC-rich minisatellite, our alignment tool could infer automatically the distinct classes of alleles. This suggests that MS_ALIGN could be well suited for deciphering the evolution of unstable, but non hypervariable, minisatellites.
Tandemly repeated protein sequences are also amenable to analysis. In an other work, our program was succesfully applied to decipher the evolution of a large family of proteins that contain a variable tandem repeat in the N-terminal parts of their sequences, the Pentatricopeptide Repeat family in Arabidopsis thaliana (Rivals et al. 2006).
A limitation of MS_ALIGN for minisatellite analysis is the restriction on duplication and contraction to operate on a single variant, and not on a block of consecutive variants. In cases of block duplications, MS_ALIGN overestimates the allele distance. In both MSY1 and INS, some alleles provide evidence for block duplications (i.e., presence of a repeated block of several variants). However, this did not prevent the inference of correct allele relationships from the distances computed by MS_ALIGN. Two reasons may explain this. First, both at MSY1 and INS loci the frequency of such events remains limited compared to single variant duplications and contractions Jeffreys, 2000, Andreassen et al. 2002). Second, as a consequence of the preponderance of single variant duplications, duplicated or contracted blocks may often be themselves a stretch of a single variant with few or no mutations. Thus, the overestimation made by MS_ALIGN tends to approximate well the real distance.
To authorize block duplications/contractions in minisatellite alignment makes the evolutionary model more general and even more realistic, but increases the complexity of the alignment procedure (as in (Sammeth et al. 2005)). A major challenge for future developments will be to generalize the evolutionary model (e.g., taking into account inter-allelic exchanges) and to design a pairwise or a multiple alignment algorithm that remains effi cient in practice. Figure 1. The Maximum Compatible Subtrees obtained by deleting leaves 4, 18, 12, 21 from the MSY1 haplogroups tree (left) and the SNP tree (right). When these deletions are applied to the two trees, the remaining subtrees contain the same leaves, but are not identical due to the polytomies in the SNP tree. This shows the similarity of the haplogroups tree of Figure 5 (in the article). . Phylogenetic tree of haplogroup 4. Each allele is given by its identifier and its population of origin. We compared the MSY1 alleles of this haplogroup pairwise with our algorithm, MS_ALIGN(version 2), and used the resulting pairwise distance matrix to infer an evolutionary tree for the alleles using a Neighbor-Joining phylogenetic reconstruction program, BIONJ . This tree perfectly separates 10 Japaneses from 3 Tibetans and 1 Mongolian. The Japanese haplotypes coalesce in a single clade.